
<!DOCTYPE html>

<html lang="en">
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="generator" content="Docutils 0.18.1: http://docutils.sourceforge.net/" />

    <title>补充习题 &#8212; Joyful Pandas 1.0 documentation</title>
<script>
  document.documentElement.dataset.mode = localStorage.getItem("mode") || "";
  document.documentElement.dataset.theme = localStorage.getItem("theme") || "light"
</script>

  <!-- Loaded before other Sphinx assets -->
  <link href="_static/styles/theme.css?digest=92025949c220c2e29695" rel="stylesheet">
<link href="_static/styles/pydata-sphinx-theme.css?digest=92025949c220c2e29695" rel="stylesheet">


  <link rel="stylesheet"
    href="_static/vendor/fontawesome/5.13.0/css/all.min.css">
  <link rel="preload" as="font" type="font/woff2" crossorigin
    href="_static/vendor/fontawesome/5.13.0/webfonts/fa-solid-900.woff2">
  <link rel="preload" as="font" type="font/woff2" crossorigin
    href="_static/vendor/fontawesome/5.13.0/webfonts/fa-brands-400.woff2">

    <link rel="stylesheet" type="text/css" href="_static/pygments.css" />
    <link rel="stylesheet" type="text/css" href="_static/plot_directive.css" />
    <link rel="stylesheet" type="text/css" href="_static/css/s4defs-roles.css" />

  <!-- Pre-loaded scripts that we'll load fully later -->
  <link rel="preload" as="script" href="_static/scripts/pydata-sphinx-theme.js?digest=92025949c220c2e29695">

    <script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
    <script src="_static/jquery.js"></script>
    <script src="_static/underscore.js"></script>
    <script src="_static/_sphinx_javascript_frameworks_compat.js"></script>
    <script src="_static/doctools.js"></script>
    <script async="async" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
    <link rel="index" title="Index" href="genindex.html" />
    <link rel="search" title="Search" href="search.html" />
    <link rel="prev" title="pandas数据处理与分析" href="pandas%E6%95%B0%E6%8D%AE%E5%A4%84%E7%90%86%E4%B8%8E%E5%88%86%E6%9E%90.html" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<meta name="docsearch:language" content="en">
  </head>
  
  
  <body data-spy="scroll" data-target="#bd-toc-nav" data-offset="180" data-default-mode="">
    <div class="bd-header-announcement container-fluid" id="banner">
      

    </div>

    
    <nav class="bd-header navbar navbar-light navbar-expand-lg bg-light fixed-top bd-navbar" id="navbar-main"><div class="bd-header__inner container-xl">

  <div id="navbar-start">
    
    
  


<a class="navbar-brand logo" href="index.html">
  
  
  
  
    <img src="_static/finallogo1.svg" class="logo__image only-light" alt="Logo image">
    <img src="_static/finallogo1.svg" class="logo__image only-dark" alt="Logo image">
  
  
</a>
    
  </div>

  <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbar-collapsible" aria-controls="navbar-collapsible" aria-expanded="false" aria-label="Toggle navigation">
    <span class="fas fa-bars"></span>
  </button>

  
  <div id="navbar-collapsible" class="col-lg-9 collapse navbar-collapse">
    <div id="navbar-center" class="mr-auto">
      
      <div class="navbar-center-item">
        <ul id="navbar-main-elements" class="navbar-nav">
    <li class="toctree-l1 nav-item">
 <a class="reference internal nav-link" href="Home.html">
  Home
 </a>
</li>

<li class="toctree-l1 nav-item">
 <a class="reference internal nav-link" href="Content/index.html">
  Content
 </a>
</li>

<li class="toctree-l1 nav-item">
 <a class="reference internal nav-link" href="Author.html">
  Author
 </a>
</li>

<li class="toctree-l1 nav-item">
 <a class="reference internal nav-link" href="Datawhale.html">
  Datawhale
 </a>
</li>

<li class="toctree-l1 nav-item">
 <a class="reference internal nav-link" href="pandas%E6%95%B0%E6%8D%AE%E5%A4%84%E7%90%86%E4%B8%8E%E5%88%86%E6%9E%90.html">
  pandas数据处理与分析
 </a>
</li>

<li class="toctree-l1 current active nav-item">
 <a class="current reference internal nav-link" href="#">
  补充习题
 </a>
</li>

    
    <li class="nav-item">
        <a class="nav-link nav-external" href="https://pandas.pydata.org/docs/index.html">Doc<i class="fas fa-external-link-alt"></i></a>
    </li>
    
</ul>
      </div>
      
    </div>

    <div id="navbar-end">
      
      <div class="navbar-end-item">
        <span id="theme-switch" class="btn btn-sm btn-outline-primary navbar-btn rounded-circle">
    <a class="theme-switch" data-mode="light"><i class="fas fa-sun"></i></a>
    <a class="theme-switch" data-mode="dark"><i class="far fa-moon"></i></a>
    <a class="theme-switch" data-mode="auto"><i class="fas fa-adjust"></i></a>
</span>
      </div>
      
      <div class="navbar-end-item">
        <ul id="navbar-icon-links" class="navbar-nav" aria-label="Icon Links">
        <li class="nav-item">
          <a class="nav-link" href="https://github.com/datawhalechina/joyful-pandas" rel="noopener" target="_blank" title="GitHub"><span><i class="fab fa-github-square"></i></span>
            <label class="sr-only">GitHub</label></a>
        </li>
      </ul>
      </div>
      
    </div>
  </div>
</div>
    </nav>
    

    <div class="bd-container container-xl">
      <div class="bd-container__inner row">
          

<!-- Only show if we have sidebars configured, else just a small margin  -->
<div class="bd-sidebar-primary col-12 col-md-3 bd-sidebar">
  <div class="sidebar-start-items"><form class="bd-search d-flex align-items-center" action="search.html" method="get">
  <i class="icon fas fa-search"></i>
  <input type="search" class="form-control" name="q" id="search-input" placeholder="Search the docs ..." aria-label="Search the docs ..." autocomplete="off" >
</form><nav class="bd-links" id="bd-docs-nav" aria-label="Main navigation">
  <div class="bd-toc-item active">
    
  </div>
</nav>
  </div>
  <div class="sidebar-end-items">
  </div>
</div>


          


<div class="bd-sidebar-secondary d-none d-xl-block col-xl-2 bd-toc">
  
    
    <div class="toc-item">
      
<div class="tocsection onthispage mt-5 pt-1 pb-3">
    <i class="fas fa-list"></i> On this page
</div>

<nav id="bd-toc-nav">
    <ul class="visible nav section-nav flex-column">
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#ex1-numpy">
   Ex1：NumPy的向量化运算
  </a>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#ex2">
   Ex2：统计学生的成绩情况
  </a>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#ex3">
   Ex3：统计商品的审核情况
  </a>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#ex4">
   Ex4：删除同样的行
  </a>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#ex5">
   Ex5：统计每个学区的开课数量
  </a>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#ex6">
   Ex6：捕获非零的行列索引
  </a>
 </li>
 <li class="toc-h2 nav-item toc-entry">
  <a class="reference internal nav-link" href="#ex7">
   Ex7：分析集群日志
  </a>
 </li>
</ul>

</nav>
    </div>
    
    <div class="toc-item">
      
    </div>
    
  
</div>


          
          
          <div class="bd-content col-12 col-md-9 col-xl-7">
              
              <article class="bd-article" role="main">
                
  <section id="id1">
<h1>补充习题<a class="headerlink" href="#id1" title="Permalink to this heading">#</a></h1>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [1]: </span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>

<span class="gp">In [2]: </span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>

<span class="gp">In [3]: </span><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="nn">plt</span>
</pre></div>
</div>
<div class="caution admonition">
<p class="admonition-title">注意</p>
<blockquote>
<div><p>在补充习题中尽可能不要使用任何for或while循环。</p>
</div></blockquote>
</div>
<section id="ex1-numpy">
<h2>Ex1：NumPy的向量化运算<a class="headerlink" href="#ex1-numpy" title="Permalink to this heading">#</a></h2>
<ul class="simple">
<li><p>给定一个正整数列表，请找出缺失的最小正整数。</p></li>
</ul>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">get_miss</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="go">1</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">6</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">get_miss</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="go">4</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">5</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">get_miss</span><span class="p">(</span><span class="n">arr</span><span class="p">)</span>
<span class="go">6</span>
</pre></div>
</div>
<ul class="simple">
<li><p>设计一个生成二维NumPy数组的函数get_res()，其输入为正整数n，返回的数组构造方式如下：第1行填入1个1，第2行在上一行填入位置的下一列连续填入2个2，第3行在第二行最后一个填入位置的下一列连续填入3个3，…，第n行在第n-1行最后一个填入位置的下一列连续填入n个n。</p></li>
</ul>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">n</span> <span class="o">=</span> <span class="mi">4</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">get_res</span><span class="p">(</span><span class="n">n</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="n">array</span><span class="p">([[</span><span class="mf">1.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">],</span>
       <span class="p">[</span><span class="mf">0.</span><span class="p">,</span> <span class="mf">2.</span><span class="p">,</span> <span class="mf">2.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">],</span>
       <span class="p">[</span><span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">3.</span><span class="p">,</span> <span class="mf">3.</span><span class="p">,</span> <span class="mf">3.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">],</span>
       <span class="p">[</span><span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">0.</span><span class="p">,</span> <span class="mf">4.</span><span class="p">,</span> <span class="mf">4.</span><span class="p">,</span> <span class="mf">4.</span><span class="p">,</span> <span class="mf">4.</span><span class="p">]])</span>
</pre></div>
</div>
<ul class="simple">
<li><p>点 <span class="math notranslate nohighlight">\(A\)</span> 初始位置在数轴原点处，现进行n步简单随机游走，即每一步以等概率向左或向右移动长度为1的距离，记该点最终的位置为 <span class="math notranslate nohighlight">\(S_n\)</span> ，则可以证明</p></li>
</ul>
<div class="math notranslate nohighlight">
\[\lim_{n\rightarrow+\infty}\frac{\mathbb{E}|S_n|}{\sqrt{n}}=\sqrt{\frac{2}{\pi}}\]</div>
<p>现取n为5000进行1000次试验，且每次试验中用 <span class="math notranslate nohighlight">\(\frac{1}{100}\sum_{k=1}^{100}|S_n|\)</span> 来代替 <span class="math notranslate nohighlight">\(\mathbb{E}|S_n|\)</span> 。此时可以计算得到1000个 <span class="math notranslate nohighlight">\(\lim_{n\rightarrow+\infty}\frac{\mathbb{E}|S_n|}{\sqrt{n}}-\sqrt{\frac{2}{\pi}}\)</span> 的估计值，请计算这些估计值的均值、0.05分位数和0.95分位数。</p>
<ul class="simple">
<li><p>在二维平面上有n个点，每个点具有k维特征，点的坐标数据记录在node_xy中，点的特征数据记录在node_fea中。现要计算所有点的相关矩阵 <span class="math notranslate nohighlight">\(S\)</span> ，点a和点b的相关系数定义如下</p></li>
</ul>
<div class="math notranslate nohighlight">
\[S_{ab} = \frac{\sigma_{ab}}{2} + \frac{\lambda_{ab}}{2}\]</div>
<p>其中，若记点a特征为 <span class="math notranslate nohighlight">\(A\)</span> ，点b特征为 <span class="math notranslate nohighlight">\(B\)</span> ，则有</p>
<div class="math notranslate nohighlight">
\[\sigma_{ab} = \frac{\sum_{i=1}^kA_iB_i}{\sqrt{\sum_{i=1}^kA^2_i}\sqrt{\sum_{i=1}^kB^2_i}}\]</div>
<p>对于点a而言，将所有点到点a的二维平面距离进行排序，从而得到每个点到点a的距离排名，距离最近（排名为1）的点是点a自身，记点b的排名为 <span class="math notranslate nohighlight">\(r^{(a)}_b\)</span> ，则定义</p>
<div class="math notranslate nohighlight">
\[\lambda_{ab} = 1 - \frac{2\times (r^{(a)}_b-1)}{n-1}\]</div>
<p>请对于给定的node_xy和node_fea计算相关矩阵 <span class="math notranslate nohighlight">\(S\)</span> 。（提示：使用np.argsort()）</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="n">n</span><span class="p">,</span> <span class="n">k</span> <span class="o">=</span> <span class="mi">1000</span><span class="p">,</span> <span class="mi">10</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">node_xy</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">node_fea</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">k</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">get_S</span><span class="p">(</span><span class="n">node_xy</span><span class="p">,</span> <span class="n">node_fea</span><span class="p">)</span>
</pre></div>
</div>
</section>
<section id="ex2">
<h2>Ex2：统计学生的成绩情况<a class="headerlink" href="#ex2" title="Permalink to this heading">#</a></h2>
<p>在data/supplement/ex2目录下存放了某校高三第一学期的学生成绩情况，包含16次周测成绩、期中考试成绩和期末考试成绩，科目一栏的成绩表示学生选课的成绩。所有的表中，相同的行表示的是同一位同学。请完成以下练习：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [4]: </span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/supplement/ex2/第1次周测成绩.csv&#39;</span><span class="p">)</span>

<span class="gp">In [5]: </span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[5]: </span>
<span class="go">   班级   姓名  选科   语文  数学   英语  科目</span>
<span class="go">0   1   吴刚  地理   93  95   82  69</span>
<span class="go">1   1   卢楠  物理  108  77   90  94</span>
<span class="go">2   1  唐秀兰  历史   88  72   95  85</span>
<span class="go">3   1   张刚  化学   85  88  102  76</span>
<span class="go">4   1   姜洋  历史  104  99   84  86</span>
</pre></div>
</div>
<ul class="simple">
<li><p>该校高三年级中是否存在姓名相同的学生？</p></li>
<li><p>在第一次周测中，请求出每个班级选修物理或化学同学的语数英总分的平均值。哪个班级最高？</p></li>
<li><p>学生在该学期的总评计算方式是各次考试总分的加权平均值，其中周测成绩权重为50%（每次测验权重相等，即3.125%），期中权重为20%，期末权重为30%。请结合nlargest函数找出年级中总评前十的同学。</p></li>
<li><p>请统计1班到8班文理科（物化生为理科，政史地为文科）期末考试总分前5的学生，结果格式如下，括号内的为选科分数：</p></li>
</ul>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [6]: </span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="gp">   ...: </span>    <span class="p">{</span>
<span class="gp">   ...: </span>        <span class="s2">&quot;1班（文）&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;王大锤：历史（102）&quot;</span><span class="p">]</span><span class="o">+</span><span class="p">[</span><span class="s2">&quot;...&quot;</span><span class="p">]</span><span class="o">*</span> <span class="mi">4</span><span class="p">,</span>
<span class="gp">   ...: </span>        <span class="s2">&quot;1班（理）&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;...&quot;</span><span class="p">]</span><span class="o">*</span> <span class="mi">5</span><span class="p">,</span>
<span class="gp">   ...: </span>        <span class="s2">&quot;2班（文）&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;...&quot;</span><span class="p">]</span><span class="o">*</span> <span class="mi">5</span><span class="p">,</span>
<span class="gp">   ...: </span>        <span class="s2">&quot;...&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;...&quot;</span><span class="p">]</span><span class="o">*</span> <span class="mi">5</span><span class="p">,</span>
<span class="gp">   ...: </span>        <span class="s2">&quot;8班（理）&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;...&quot;</span><span class="p">]</span><span class="o">*</span> <span class="mi">5</span><span class="p">,</span>
<span class="gp">   ...: </span>    <span class="p">}</span>
<span class="gp">   ...: </span><span class="p">)</span> <span class="c1"># 王大锤：历史（102）只是举个例子，表示结果字符串需要按照这个格式来写</span>
<span class="gp">   ...: </span>
<span class="gh">Out[6]: </span>
<span class="go">         1班（文） 1班（理） 2班（文）  ... 8班（理）</span>
<span class="go">0  王大锤：历史（102）   ...   ...  ...   ...</span>
<span class="go">1          ...   ...   ...  ...   ...</span>
<span class="go">2          ...   ...   ...  ...   ...</span>
<span class="go">3          ...   ...   ...  ...   ...</span>
<span class="go">4          ...   ...   ...  ...   ...</span>
</pre></div>
</div>
<ul class="simple">
<li><p>学生成绩的稳定性可以用每次考试在全年级相同选科学生中的总分排名标准差来度量，请计算每个班级的各科学生成绩稳定性的均值，结果格式如下：</p></li>
</ul>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [7]: </span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="gp">   ...: </span>    <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">11</span><span class="p">,</span> <span class="mi">6</span><span class="p">),</span>
<span class="gp">   ...: </span>    <span class="n">index</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">Index</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span> <span class="n">name</span><span class="o">=</span><span class="s2">&quot;班级&quot;</span><span class="p">),</span>
<span class="gp">   ...: </span>    <span class="n">columns</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">Index</span><span class="p">(</span>
<span class="gp">   ...: </span>        <span class="p">[</span><span class="s2">&quot;物理&quot;</span><span class="p">,</span> <span class="s2">&quot;化学&quot;</span><span class="p">,</span> <span class="s2">&quot;生物&quot;</span><span class="p">,</span> <span class="s2">&quot;历史&quot;</span><span class="p">,</span> <span class="s2">&quot;地理&quot;</span><span class="p">,</span> <span class="s2">&quot;政治&quot;</span><span class="p">],</span>
<span class="gp">   ...: </span>        <span class="n">name</span><span class="o">=</span><span class="s2">&quot;选科&quot;</span><span class="p">,</span>
<span class="gp">   ...: </span>    <span class="p">)</span>
<span class="gp">   ...: </span><span class="p">)</span>
<span class="gp">   ...: </span>
<span class="gh">Out[7]: </span>
<span class="go">选科        物理        化学        生物        历史        地理        政治</span>
<span class="go">班级                                                            </span>
<span class="go">1   0.461479  0.780529  0.118274  0.639921  0.143353  0.944669</span>
<span class="go">2   0.521848  0.414662  0.264556  0.774234  0.456150  0.568434</span>
<span class="go">3   0.018790  0.617635  0.612096  0.616934  0.943748  0.681820</span>
<span class="go">4   0.359508  0.437032  0.697631  0.060225  0.666767  0.670638</span>
<span class="go">5   0.210383  0.128926  0.315428  0.363711  0.570197  0.438602</span>
<span class="go">6   0.988374  0.102045  0.208877  0.161310  0.653108  0.253292</span>
<span class="go">7   0.466311  0.244426  0.158970  0.110375  0.656330  0.138183</span>
<span class="go">8   0.196582  0.368725  0.820993  0.097101  0.837945  0.096098</span>
<span class="go">9   0.976459  0.468651  0.976761  0.604846  0.739264  0.039188</span>
<span class="go">10  0.282807  0.120197  0.296140  0.118728  0.317983  0.414263</span>
<span class="go">11  0.064147  0.692472  0.566601  0.265389  0.523248  0.093941</span>
</pre></div>
</div>
</section>
<section id="ex3">
<h2>Ex3：统计商品的审核情况<a class="headerlink" href="#ex3" title="Permalink to this heading">#</a></h2>
<p>在data/supplement/ex3中存放了两个有关商品审核的信息表，“商品信息.csv”中记录了每个商品的ID号，唯一的识别码以及商品所属的类别，“申请与审核记录.csv”中记录了每个商品的审核信息。已知商品的审核流程如下：由申请人发起商品审核的申请，然后由审核人审核，审核的结果包括通过与不通过两种情况，若商品不通过审核则可以由另一位申请人再次发起申请，直到商品的审核通过。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [8]: </span><span class="n">df_info</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/supplement/ex3/商品信息.csv&#39;</span><span class="p">)</span>

<span class="gp">In [9]: </span><span class="n">df_info</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[9]: </span>
<span class="go">         ID号      识别码  类别</span>
<span class="go">0  ID 000001  CRtXJUK  T1</span>
<span class="go">1  ID 000002  RGSxifC  Q1</span>
<span class="go">2  ID 000003  AboduTp  S1</span>
<span class="go">3  ID 000004  zlpUeMl  S2</span>
<span class="go">4  ID 000005  IVQqhIK  S3</span>

<span class="gp">In [10]: </span><span class="n">df_record</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/supplement/ex3/申请与审核记录.csv&#39;</span><span class="p">)</span>

<span class="gp">In [11]: </span><span class="n">df_record</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[11]: </span>
<span class="go">         ID号         申请人        申请时间         审核人        审核时间   结果</span>
<span class="go">0  ID 000001  \#+3((52\{  2020-04-19  ~1=6\*183|  2020-05-03  未通过</span>
<span class="go">1  ID 000001  8@75[1|2\*  2020-05-10  15![3\({59  2020-07-17  未通过</span>
<span class="go">2  ID 000001  }!7)(#^0*7  2020-07-28  3`}04}%@75  2020-08-23   通过</span>
<span class="go">3  ID 000002  |*{20#9|}5  2020-01-05  ={`8]03*4+  2020-03-09  未通过</span>
<span class="go">4  ID 000002  4~6%)455`[  2020-03-14  =$-36[)|8]  2020-04-21  未通过</span>
</pre></div>
</div>
<ul class="simple">
<li><p>有多少商品最终通过审核？</p></li>
<li><p>各类别商品的通过率分别为多少？</p></li>
<li><p>对于类别为“T1”且最终状态为通过的商品，平均审核次数为多少？</p></li>
<li><p>是否存在商品在上一次审核未完成时就提交了下一次审核申请？</p></li>
<li><p>请对所有审核通过的商品统计第一位申请人和最后一位审核人的信息，返回格式如下：</p></li>
</ul>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [12]: </span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="gp">   ....: </span>    <span class="p">{</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;ID号&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;ID 000001&quot;</span><span class="p">]</span><span class="o">+</span><span class="p">[</span><span class="s2">&quot;...&quot;</span><span class="p">]</span><span class="o">*</span><span class="mi">3</span><span class="p">,</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;类别&quot;</span><span class="p">:[</span><span class="s2">&quot;T1&quot;</span><span class="p">]</span><span class="o">+</span><span class="p">[</span><span class="s2">&quot;...&quot;</span><span class="p">]</span><span class="o">*</span><span class="mi">3</span><span class="p">,</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;申请人&quot;</span><span class="p">:[</span><span class="s2">&quot;\#+3((52\{&quot;</span><span class="p">]</span><span class="o">+</span><span class="p">[</span><span class="s2">&quot;...&quot;</span><span class="p">]</span><span class="o">*</span><span class="mi">3</span><span class="p">,</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;审核人&quot;</span><span class="p">:[</span><span class="s2">&quot;3`}04}%@75&quot;</span><span class="p">]</span><span class="o">+</span><span class="p">[</span><span class="s2">&quot;...&quot;</span><span class="p">]</span><span class="o">*</span><span class="mi">3</span>
<span class="gp">   ....: </span>    <span class="p">},</span>
<span class="gp">   ....: </span>    <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="s2">&quot;...&quot;</span><span class="p">]</span>
<span class="gp">   ....: </span><span class="p">)</span>
<span class="gp">   ....: </span>
<span class="gh">Out[12]: </span>
<span class="go">           ID号   类别         申请人         审核人</span>
<span class="go">1    ID 000001   T1  \#+3((52\{  3`}04}%@75</span>
<span class="go">2          ...  ...         ...         ...</span>
<span class="go">3          ...  ...         ...         ...</span>
<span class="go">...        ...  ...         ...         ...</span>
</pre></div>
</div>
<div class="hint admonition">
<p class="admonition-title">提示</p>
<blockquote>
<div><p>groupby对象上也定义了head和tail方法。</p>
</div></blockquote>
</div>
</section>
<section id="ex4">
<h2>Ex4：删除同样的行<a class="headerlink" href="#ex4" title="Permalink to this heading">#</a></h2>
<p>现有两张表，请在df1中剔除在df2中出现过的行。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [13]: </span><span class="n">df1</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;A&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;B&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">2</span><span class="p">],</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;C&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;D&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">5</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">6</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">5</span><span class="p">],</span>
<span class="gp">   ....: </span><span class="p">})</span>
<span class="gp">   ....: </span>

<span class="gp">In [14]: </span><span class="n">df1</span>
<span class="gh">Out[14]: </span>
<span class="go">   A  B  C  D</span>
<span class="go">0  3  2  1  5</span>
<span class="go">1  2  1  2  6</span>
<span class="go">2  2  1  2  6</span>
<span class="go">3  3  3  7  1</span>
<span class="go">4  1  6  7  2</span>
<span class="go">5  3  2  1  5</span>

<span class="gp">In [15]: </span><span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;A&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;B&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">6</span><span class="p">],</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;C&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">7</span><span class="p">],</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;D&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">6</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">],</span>
<span class="gp">   ....: </span><span class="p">})</span>
<span class="gp">   ....: </span>

<span class="gp">In [16]: </span><span class="n">df2</span>
<span class="gh">Out[16]: </span>
<span class="go">   A  B  C  D</span>
<span class="go">0  2  1  2  6</span>
<span class="go">1  3  9  7  1</span>
<span class="go">2  1  6  7  2</span>
</pre></div>
</div>
<p>结果应当如下：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [17]: </span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;A&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;B&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">],</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;C&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">1</span><span class="p">],</span>
<span class="gp">   ....: </span>    <span class="s2">&quot;D&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mi">5</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">5</span><span class="p">],</span>
<span class="gp">   ....: </span><span class="p">})</span>
<span class="gp">   ....: </span>
<span class="gh">Out[17]: </span>
<span class="go">   A  B  C  D</span>
<span class="go">0  3  2  1  5</span>
<span class="go">1  3  3  7  1</span>
<span class="go">2  3  2  1  5</span>
</pre></div>
</div>
</section>
<section id="ex5">
<h2>Ex5：统计每个学区的开课数量<a class="headerlink" href="#ex5" title="Permalink to this heading">#</a></h2>
<p>某个城市共有4个学区，每个学区有若干学校，学校之间名字互不相同。每一条记录为该学校开设的课程，一个学校可能有多条记录，每一条记录内部的课程不会重复，但同一学校不同记录之间的课程可能重复。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [18]: </span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">&#39;data/supplement/ex5/school_course.csv&#39;</span><span class="p">)</span>

<span class="gp">In [19]: </span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gh">Out[19]: </span>
<span class="go">     Area      School                         Course</span>
<span class="go">0  area_1   school_99                      course_90</span>
<span class="go">1  area_2   school_32                      course_20</span>
<span class="go">2  area_3   school_64                      course_38</span>
<span class="go">3  area_1  school_231  course_9 course_40 course_100</span>
<span class="go">4  area_3  school_147  course_57 course_77 course_28</span>
</pre></div>
</div>
<p>课程的种类共有100门，编号为”school_1”到”school_100”。现要统计每个学区各项课程的开设学校数量，结果如下格式：</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [20]: </span><span class="n">res</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="gp">   ....: </span>    <span class="mi">0</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;course_</span><span class="si">%d</span><span class="s2">&quot;</span><span class="o">%</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">)],</span>
<span class="gp">   ....: </span>    <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;area_</span><span class="si">%d</span><span class="s2">&quot;</span><span class="o">%</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">4</span><span class="p">)]</span>
<span class="gp">   ....: </span><span class="p">)</span>
<span class="gp">   ....: </span>

<span class="gp">In [21]: </span><span class="n">res</span><span class="o">.</span><span class="n">head</span><span class="p">()</span> <span class="c1"># 若area_1共有20所学校开设了course_1，则第一个单元格为20</span>
<span class="gh">Out[21]: </span>
<span class="go">          area_1  area_2  area_3  area_4</span>
<span class="go">course_1       0       0       0       0</span>
<span class="go">course_2       0       0       0       0</span>
<span class="go">course_3       0       0       0       0</span>
<span class="go">course_4       0       0       0       0</span>
<span class="go">course_5       0       0       0       0</span>
</pre></div>
</div>
</section>
<section id="ex6">
<h2>Ex6：捕获非零的行列索引<a class="headerlink" href="#ex6" title="Permalink to this heading">#</a></h2>
<p>给定如下的数据框，请返回非零行列组合构成的多级索引。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [22]: </span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="gp">   ....: </span>    <span class="p">[[</span><span class="mi">0</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">0</span><span class="p">],[</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">],[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">6</span><span class="p">],[</span><span class="mi">0</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">0</span><span class="p">]],</span>
<span class="gp">   ....: </span>    <span class="n">index</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="s2">&quot;ABCD&quot;</span><span class="p">),</span> <span class="n">columns</span><span class="o">=</span><span class="nb">list</span><span class="p">(</span><span class="s2">&quot;XYZ&quot;</span><span class="p">))</span>
<span class="gp">   ....: </span>

<span class="gp">In [23]: </span><span class="n">df</span>
<span class="gh">Out[23]: </span>
<span class="go">   X  Y  Z</span>
<span class="go">A  0  5  0</span>
<span class="go">B  2  1  0</span>
<span class="go">C  0  0  6</span>
<span class="go">D  0  9  0</span>

<span class="gp">In [24]: </span><span class="n">res</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Index</span><span class="p">([</span>
<span class="gp">   ....: </span>    <span class="p">(</span><span class="s1">&#39;X&#39;</span><span class="p">,</span> <span class="s1">&#39;B&#39;</span><span class="p">),</span>
<span class="gp">   ....: </span>    <span class="p">(</span><span class="s1">&#39;Y&#39;</span><span class="p">,</span> <span class="s1">&#39;A&#39;</span><span class="p">),</span>
<span class="gp">   ....: </span>    <span class="p">(</span><span class="s1">&#39;Y&#39;</span><span class="p">,</span> <span class="s1">&#39;B&#39;</span><span class="p">),</span>
<span class="gp">   ....: </span>    <span class="p">(</span><span class="s1">&#39;Y&#39;</span><span class="p">,</span> <span class="s1">&#39;D&#39;</span><span class="p">),</span>
<span class="gp">   ....: </span>    <span class="p">(</span><span class="s1">&#39;Z&#39;</span><span class="p">,</span> <span class="s1">&#39;C&#39;</span><span class="p">)])</span>
<span class="gp">   ....: </span>

<span class="gp">In [25]: </span><span class="n">res</span>
<span class="gh">Out[25]: </span>
<span class="go">MultiIndex([(&#39;X&#39;, &#39;B&#39;),</span>
<span class="go">            (&#39;Y&#39;, &#39;A&#39;),</span>
<span class="go">            (&#39;Y&#39;, &#39;B&#39;),</span>
<span class="go">            (&#39;Y&#39;, &#39;D&#39;),</span>
<span class="go">            (&#39;Z&#39;, &#39;C&#39;)],</span>
<span class="go">           )</span>
</pre></div>
</div>
</section>
<section id="ex7">
<h2>Ex7：分析集群日志<a class="headerlink" href="#ex7" title="Permalink to this heading">#</a></h2>
<p>某公司构建了一个分布式文件集群，它共有134台服务器构成，分别存放在五个机房，R0机房存有23台，R1机房存有16台，R2机房存有47台，R3机房存有30台，R4机房存有18台，每个机房的服务器编号从001开始。运维人员通过日志收集功能得到了如下所示的集群在2022年9月27日的文件历史传输记录，其每一行构成如下：方括号中显示了当前操作是否为向其他服务器发出文件的操作（PUSH）还是接收其他服务器文件的操作（SAVE）及其对应的操作时间。Cluster#R?#???表示了当前操作的机器编号，Cluster#R4#014表示R4机房的第14号机器；再后面的十位字符串代表了传输文件的唯一标识，如果某一个条记录为SAVE操作的机器接收了XXX文件，那么一定会有另一台机器PUSH这个XXX文件的记录；对于PUSH记录而言，最后的信息表示发出文件的大小，对于SAVE记录而言，最后的信息表示接收到文件的大小，若同一对PUSH记录和SAVE记录的文件大小不一致，那么表明本次文件传输最终处于未完成状态（Unfinished）。</p>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [26]: </span><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s2">&quot;data/supplement/ex7/logs.txt&quot;</span><span class="p">,</span> <span class="s2">&quot;r&quot;</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
<span class="gp">   ....: </span>    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">txt</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">f</span><span class="o">.</span><span class="n">readlines</span><span class="p">()):</span>
<span class="gp">   ....: </span>        <span class="k">if</span> <span class="n">i</span> <span class="o">&gt;=</span> <span class="mi">5</span><span class="p">:</span>
<span class="gp">   ....: </span>            <span class="k">break</span>
<span class="gp">   ....: </span>        <span class="nb">print</span><span class="p">(</span><span class="n">txt</span><span class="o">.</span><span class="n">strip</span><span class="p">())</span>
<span class="gp">   ....: </span>
<span class="go">[PUSH|2022-09-27 07:08:12] Cluster#R1#007 | tgfuHOAjDJ | 4.41 GB</span>
<span class="go">[PUSH|2022-09-27 15:11:49] Cluster#R2#027 | AFHAvugnTR | 91.64 MB</span>
<span class="go">[PUSH|2022-09-27 06:54:02] Cluster#R3#016 | cJwLKcNsmA | 489.25 MB</span>
<span class="go">[SAVE|2022-09-27 08:17:00] Cluster#R0#019 | neLAGbGkvd | 7.99 GB</span>
<span class="go">[PUSH|2022-09-27 05:31:50] Cluster#R2#012 | rXZyuYLlEE | 730.45 MB</span>
</pre></div>
</div>
<ul class="simple">
<li><p>使用高效方法提取日志中的信息，并注意脏数据的清洗（如时间格式错误和无效数字），将其存放为如下格式，其中push_time按时间顺序。file_id为文件唯一标识，file_size为文件实际大小，save_fize为文件最终被接收的大小，push_from表示PUSH该文件的服务器，push_to表示SAVE该文件的服务器。</p></li>
</ul>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [27]: </span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="gp">   ....: </span>    <span class="p">{</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;file_id&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;wfjqoIDhsD&quot;</span><span class="p">,</span> <span class="s2">&quot;QigjDSEGje&quot;</span><span class="p">,</span> <span class="s2">&quot;...&quot;</span><span class="p">],</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;file_size&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mf">6.35</span><span class="p">,</span> <span class="mf">149.23</span><span class="p">,</span> <span class="s2">&quot;...&quot;</span><span class="p">],</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;save_size&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mf">6.32</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">nan</span><span class="p">,</span> <span class="s2">&quot;...&quot;</span><span class="p">],</span> <span class="c1"># np.nan表示没收到</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;push_from&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;A3-007&quot;</span><span class="p">,</span> <span class="s2">&quot;A0-017&quot;</span><span class="p">,</span> <span class="s2">&quot;...&quot;</span><span class="p">],</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;push_to&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;A2-012&quot;</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">nan</span><span class="p">,</span> <span class="s2">&quot;...&quot;</span><span class="p">],</span> <span class="c1"># np.nan表示没收到</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;push_time&quot;</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">([</span>
<span class="gp">   ....: </span>            <span class="s2">&quot;20220927 01:03:55&quot;</span><span class="p">,</span> <span class="s2">&quot;20220927 01:03:58&quot;</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">NaT</span><span class="p">]),</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;save_time&quot;</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">([</span>
<span class="gp">   ....: </span>            <span class="s2">&quot;20220927 01:03:57&quot;</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">NaT</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">NaT</span><span class="p">]),</span>
<span class="gp">   ....: </span>    <span class="p">},</span>
<span class="gp">   ....: </span>    <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="s2">&quot;...&quot;</span><span class="p">]</span>
<span class="gp">   ....: </span><span class="p">)</span> <span class="c1"># 数据仅为格式参考，不代表真实数据</span>
<span class="gp">   ....: </span>
<span class="gh">Out[27]: </span>
<span class="go">        file_id file_size save_size push_from push_to           push_time           save_time</span>
<span class="go">0    wfjqoIDhsD      6.35      6.32    A3-007  A2-012 2022-09-27 01:03:55 2022-09-27 01:03:57</span>
<span class="go">1    QigjDSEGje    149.23       NaN    A0-017     NaN 2022-09-27 01:03:58                 NaT</span>
<span class="go">...         ...       ...       ...       ...     ...                 NaT                 NaT</span>
</pre></div>
</div>
<div class="hint admonition">
<p class="admonition-title">提示</p>
<blockquote>
<div><p>本质上是把两个一一对应的文件信息进行连接。</p>
</div></blockquote>
</div>
<ul class="simple">
<li><p>一般而言，文件在同一机房内的传输速度会比跨机房的传输速度快。请对于所有传输完成的文件，按照服务器的机房号来计算文件传输的平均速度（MB/s）。格式如下，第i行第j列表示从所有从机房i传到机房j传输完成文件的平均速度，矩阵的对角线值是否要高于非对角线值？</p></li>
</ul>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [28]: </span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="gp">   ....: </span>    <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">25</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">),</span>
<span class="gp">   ....: </span>    <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;R</span><span class="si">%d</span><span class="s2">&quot;</span><span class="o">%</span><span class="k">i</span> for i in range(5)],
<span class="gp">   ....: </span>    <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s2">&quot;R</span><span class="si">%d</span><span class="s2">&quot;</span><span class="o">%</span><span class="k">i</span> for i in range(5)],
<span class="gp">   ....: </span><span class="p">)</span> <span class="c1"># 数据仅为格式参考，不代表真实数据</span>
<span class="gp">   ....: </span>
<span class="gh">Out[28]: </span>
<span class="go">          R0        R1        R2        R3        R4</span>
<span class="go">R0  0.575946  0.929296  0.318569  0.667410  0.131798</span>
<span class="go">R1  0.716327  0.289406  0.183191  0.586513  0.020108</span>
<span class="go">R2  0.828940  0.004695  0.677817  0.270008  0.735194</span>
<span class="go">R3  0.962189  0.248753  0.576157  0.592042  0.572252</span>
<span class="go">R4  0.223082  0.952749  0.447125  0.846409  0.699479</span>
</pre></div>
</div>
<ul class="simple">
<li><p>题干中提到，并非所有文件都会传输成功，文件传输成功（Finished）当且仅当文件大小等于接收大小；如果日志中出现了只有单条记录的文件，说明当前传输任务为Missed状态；如果文件大小不等于接收大小，说明当前任务为Unfinished状态。其中对于Unfinished状态，可以按照传输的比例超过90%和超过50%分为三档：”Unfinished-Almost”、”Unfinished-Fair”和”Unfinished-Bad”。请计算每个机房的最终状态的比例，格式如下，结果先按照状态Status排序（”Finished”&gt;”Unfinished-Almost”&gt;”Unfinished-Fair”&gt;”Unfinished-Bad”&gt;”Missed”），再按照机房号排序。</p></li>
</ul>
<div class="highlight-ipython notranslate"><div class="highlight"><pre><span></span><span class="gp">In [29]: </span><span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span>
<span class="gp">   ....: </span>    <span class="p">{</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;Status&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;Finished&quot;</span><span class="p">]</span><span class="o">*</span><span class="mi">3</span> <span class="o">+</span> <span class="p">[</span><span class="s2">&quot;...&quot;</span><span class="p">]</span> <span class="o">+</span> <span class="p">[</span><span class="s2">&quot;Missed&quot;</span><span class="p">],</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;Room&quot;</span><span class="p">:</span> <span class="p">[</span><span class="s2">&quot;R0&quot;</span><span class="p">,</span> <span class="s2">&quot;R1&quot;</span><span class="p">,</span> <span class="s2">&quot;R2&quot;</span><span class="p">,</span> <span class="s2">&quot;...&quot;</span><span class="p">,</span> <span class="s2">&quot;R5&quot;</span><span class="p">],</span>
<span class="gp">   ....: </span>        <span class="s2">&quot;Ratio&quot;</span><span class="p">:</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.15</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span> <span class="s2">&quot;...&quot;</span><span class="p">,</span> <span class="mf">0.05</span><span class="p">],</span>
<span class="gp">   ....: </span>    <span class="p">},</span>
<span class="gp">   ....: </span>    <span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="s2">&quot;...&quot;</span><span class="p">,</span><span class="mi">24</span><span class="p">]</span>
<span class="gp">   ....: </span><span class="p">)</span>
<span class="gp">   ....: </span>
<span class="gh">Out[29]: </span>
<span class="go">       Status Room Ratio</span>
<span class="go">0    Finished   R0  0.10</span>
<span class="go">1    Finished   R1  0.15</span>
<span class="go">2    Finished   R2  0.10</span>
<span class="go">...       ...  ...   ...</span>
<span class="go">24     Missed   R5  0.05</span>
</pre></div>
</div>
<ul class="simple">
<li><p>按小时计算每个机房发送的大文件数和接收的大文件数之差，其中大文件指大小超过800M的文件。结果的行索引是时间，列索引是机房。</p></li>
<li><p>按小时计算每台机器的空闲率，对于某一台机器而言，空闲时间指其既没有处于发送任务中又没有处于接收任务中的时间，每个小时的空闲率指当前小时的区间内空闲时间的占比。结果的行索引是时间，列索引是机器。</p></li>
</ul>
<div class="hint admonition">
<p class="admonition-title">提示</p>
<blockquote>
<div><p>本题涉及到了 <a class="reference external" href="https://leetcode.cn/problems/merge-intervals/">区间合并问题</a> ，pandas的区间索引没有定义类似于merge intervals的函数，请阅读 <a class="reference external" href="https://stackoverflow.com/questions/57882621/efficient-merge-overlapping-intervals-in-same-pandas-dataframe-with-start-and-fi">这个回答</a> 来思考如何实现这个功能。</p>
</div></blockquote>
</div>
</section>
</section>


              </article>
              

              
          </div>
          
      </div>
    </div>

  
  
  <!-- Scripts loaded after <body> so the DOM is not blocked -->
  <script src="_static/scripts/pydata-sphinx-theme.js?digest=92025949c220c2e29695"></script>

<footer class="bd-footer"><div class="bd-footer__inner container">
  
  <div class="footer-item">
    <p class="copyright">
    &copy; Copyright 2020-2022, Datawhale, 耿远昊.<br>
</p>
  </div>
  
  <div class="footer-item">
    <p class="sphinx-version">
Created using <a href="http://sphinx-doc.org/">Sphinx</a> 5.0.2.<br>
</p>
  </div>
  
</div>
</footer>
  </body>
</html>